001    /*
002     * Trie.java
003     *
004     * Copyright 2003 Sergio Anibal de Carvalho Junior
005     *
006     * This file is part of NeoBio.
007     *
008     * NeoBio is free software; you can redistribute it and/or modify it under the terms of
009     * the GNU General Public License as published by the Free Software Foundation; either
010     * version 2 of the License, or (at your option) any later version.
011     *
012     * NeoBio is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
013     * without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
014     * PURPOSE. See the GNU General Public License for more details.
015     *
016     * You should have received a copy of the GNU General Public License along with NeoBio;
017     * if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330,
018     * Boston, MA 02111-1307, USA.
019     *
020     * Proper attribution of the author as the source of the software would be appreciated.
021     *
022     * Sergio Anibal de Carvalho Junior             mailto:sergioanibaljr@users.sourceforge.net
023     * Department of Computer Science               http://www.dcs.kcl.ac.uk
024     * King's College London, UK                    http://www.kcl.ac.uk
025     *
026     * Please visit http://neobio.sourceforge.net
027     *
028     * This project was supervised by Professor Maxime Crochemore.
029     *
030     */
031    
032    package neobio.alignment;
033    
034    /**
035     * This class implements a trie, or a digital search tree. A trie is a multiway tree
036     * (each node can have multiple children) that represents a set of strings.
037     *
038     * <P>Each node contains data encapsulated in an object instance. Each edge spells out a
039     * character and each path from the root represents a string described by the characters
040     * labelling the traversed edges. Moreover, for each string represented, there is a unique
041     * path from the root.</P>
042     *
043     * <P>The trie of the following example represents the strings 'a', 'd', 'b', 'ac', 'ba',
044     * 'be', 'bd', 'bad' and 'bae'.</P>
045     *
046     * <CODE><BLOCKQUOTE><PRE>
047     *      [0]
048     *     --+--
049     *    /  |  \
050     *  a/  d|   \b
051     * [1]  [2]  [4]
052     *  |       --+--
053     *  |      /  |  \
054     * c|    a/  e|  d\
055     * [3]  [5]  [6]  [9]
056     *     --+--
057     *    /     \
058     *  d/      e\
059     * [7]       [8]
060     * </PRE></BLOCKQUOTE></CODE>
061     *
062     * <P>It is easy to see that strings with common prefixes will branch off from each other
063     * at the first distinguishing character. This feature makes the trie a good data
064     * structure to identify and represent phrases of a text such as the ones induced by the
065     * Lempel-Ziv familiy of compression algorithms. For instance, the LZ78 version parses
066     * the text into phrases, where each phrase is the longest matching phrase seen previously
067     * plus one character.</P>
068     *
069     * <P>In this implementation, each node is actually an instance of this class. To build a
070     * trie, one must first create the root using the public constructor:</P>
071     *
072     * <CODE><BLOCKQUOTE><PRE>
073     * Trie root = new Trie (some_object);
074     * </PRE></BLOCKQUOTE></CODE>
075     *
076     * <P>Here <CODE>some_object</CODE> contains any relevant information encapsulated in an
077     * object instance. Typically, that's the only moment the public constructor is used. From
078     * now on, all new nodes will be added as a new child of one existing node using the
079     * <CODE>add</CODE> method:</P>
080     *
081     * <CODE><BLOCKQUOTE><PRE>
082     * new_node = any_node.add (some_object, character);
083     * </PRE></BLOCKQUOTE></CODE>
084     *
085     * <P>Here <CODE>character</CODE> is the character that will label the edge from
086     * <CODE>any_node</CODE> to <CODE>new_node</CODE>. Note that this transition must not
087     * already exist, otherwise an exception is raised.
088     *
089     * <P>To find the longest prefix of a given string, we follow a path from the root down
090     * the tree, character by character, with the <CODE>spellDown</CODE> method:</P>
091     *
092     * <CODE><BLOCKQUOTE><PRE>
093     * next_node = root;
094     * while (next_node != null)
095     * {
096     *     current_node = next_node;
097     *     char c = get next character from somewhere
098     *     <B>next_node = current_node.spellDown (c);</B>
099     * }
100     * </PRE></BLOCKQUOTE></CODE>
101     *
102     * <P><CODE>spellDown</CODE> follows the edge out of <CODE>current_node</CODE> labelled by
103     * the character <CODE>c</CODE> and returns the next node. If there is no such a path, it
104     * returns null.</P>
105     *
106     * <P>To retrieve the information stored at any node, simply use the <CODE>getData</CODE>
107     * method.</P>
108     *
109     * <P>In fact, there are many ways to implement a trie. To avoid wasting space with
110     * multiple pointers at each node, this implementation uses an approach with a linked list
111     * of siblings. Each node actually contains a pointer to one of its children and a pointer
112     * to one of its siblings only. Together with the pointers, each node also stores the
113     * character that labels the edge to the pointed node.<P>
114     *
115     * <CODE><BLOCKQUOTE><PRE>
116     * [0]
117     *  |
118     * a|  d     b
119     * [1]---[2]---[4]
120     *  |           |
121     * c|          a|  e     d
122     * [3]         [5]---[6]---[9]
123     *              |
124     *             d|  e
125     *             [7]---[8]
126     * </PRE></BLOCKQUOTE></CODE>
127     *
128     * <P>In this way, a trie is similar to a binary tree. Although this implementation is
129     * more efficient in terms of space, the search for a label with a given character leaving
130     * a node <CODE>n</CODE> is no more constant but proportional to the number of children of
131     * <CODE>n</CODE>. In the previous example, it is necessary to traverse three edges to
132     * reach node 9 from node 4 with character d.</P>
133     *
134     * <P>This class is used by the {@linkplain FactorSequence} to build a linked list of
135     * factors of a sequence in a LZ78 fashion, i.e. where each factor is the longest factor
136     * previously seen plus one character.</P>
137     *
138     * @author Sergio A. de Carvalho Jr.
139     * @see FactorSequence
140     */
141    public class Trie
142    {
143            /**
144             * A pointer to the first of this node's children.
145             */
146            protected Trie son;
147    
148            /**
149             * The character that labels the edge from this node to the child node pointer by
150             * <CODE>son</CODE>.
151             */
152            protected char to_son;
153    
154            /**
155             * A pointer to this node's next sibling.
156             */
157            protected Trie sibling;
158    
159            /**
160             * The character that labels the edge from this node to the sibling pointer by
161             * <CODE>sibling</CODE>.
162             */
163            protected char to_sibling;
164    
165            /**
166             * This node's stored data.
167             */
168            protected Object        data;
169    
170            /**
171             * Creates a new trie node with the specified data. This constructor is typically used
172             * by the client only once to instantiate the root node. After that, all new nodes are
173             * implicitly instantiated by the <CODE>add</CODE> method.
174             *
175             * @param data the data that will be associated with the new node
176             */
177            public Trie (Object data)
178            {
179                    this.son = null;
180                    this.sibling = null;
181                    this.data = data;
182            }
183    
184            /**
185             * Returns the data associated with this node.
186             *
187             * @return data associated with this node
188             */
189            public Object getData ()
190            {
191                    return data;
192            }
193    
194            /**
195             * Adds a new child to this node. The new node will be implicitly instantiated with
196             * the <CODE>data</CODE> argument, and the edge from this node to the new node will be
197             * labelled by the character argument. If this node already have an edge labelled with
198             * this character, an exception is raised. Otherwise, the new node created and
199             * returned.
200             *
201             * <P>If this node have no child, a new node is created straight away. Otherwise, the
202             * task is assigned to its first child that will add the new node as a sibling.</P>
203             *
204             * @param data the data that will be associated with the new node
205             * @param c the character that will label the edge from this node to the new node
206             * @return the added node
207             * @throws IllegalStateException if this node already have an edge labelled by
208             * <CODE>c</CODE>
209             */
210            public Trie add (Object data, char c)
211            {
212                    if (son == null)
213                    {
214                            son = new Trie (data);
215                            to_son = c;
216                            return son;
217                    }
218                    else
219                    {
220                            if (to_son != c)
221                                    return son.addSibling (data, c);
222                            else
223                                    // duplicate char
224                                    throw new IllegalStateException ("Failed to add character " + c +
225                                                                                                                                    " already exists.");
226                    }
227            }
228    
229            /**
230             * Adds a sibling to this node. The new node will be implicitly instantiated with
231             * the <CODE>data</CODE> argument, and the edge from this node to the new node will be
232             * labelled by the character argument. If this node already have a sibling with this
233             * character, an exception is raised. Otherwise, the new node is created and returned.
234             *
235             * <P>If this node have no direct sibling, a new node is created straight away.
236             * Otherwise, the task is assigned to its next sibling.</P>
237             *
238             * @param data the data that will be associated with the new node
239             * @param c the character that will label the edge from this node to the new node
240             * @return the added node
241             * @throws IllegalStateException if this node already have an edge labelled by
242             * <CODE>c</CODE>
243             */
244            protected Trie addSibling (Object data, char c)
245            {
246                    if (sibling == null)
247                    {
248                            sibling = new Trie (data);
249                            to_sibling = c;
250                            return sibling;
251                    }
252                    else
253                    {
254                            if (to_sibling != c)
255                                    return sibling.addSibling (data, c);
256                            else
257                                    // duplicate char
258                                    throw new IllegalStateException ("Failed to add character: " + c +
259                                                                                                                                    " already exists.");
260                    }
261            }
262    
263            /**
264             * Follows a path from this node to one of its children by spelling the character
265             * supplied as an argument. If there is no such a path, <CODE>null</CODE> is returned.
266             * Otherwise, the reached child node is returned.
267             *
268             * <P>If this node's direct child is reached with an edge labelled by the character,
269             * it is returned straight away. Otherwise, it is assigned the task of finding another
270             * sibling labelled with that character.</P>
271             *
272             * @param c the character that labels the path to be followed to this node's child
273             * @return the child node reached by traversing the edge labelled by <CODE>c</CODE>
274             */
275            public Trie spellDown (char c)
276            {
277                    if (son == null) return null;
278    
279                    if (to_son == c)
280                            return son;
281                    else
282                            return son.spellRight(c);
283            }
284    
285            /**
286             * Follows a path from this node to one of its sibling by spelling the character
287             * supplied as an argument. If there is no such a path, <CODE>null</CODE> is returned.
288             * Otherwise, the reached sibling node is returned.
289             *
290             * <P>If this node's direct sibling is reached with an edge labelled by the character,
291             * it is returned straight away. Otherwise, it is assigned the task of finding another
292             * sibling labelled with that character.</P>
293             *
294             * @param c the character that labels the path to be followed to the sibling
295             * @return the sibling node reached by traversing the edge labelled by <CODE>c</CODE>
296             */
297            protected Trie spellRight (char c)
298            {
299                    if (sibling == null) return null;
300    
301                    if (to_sibling == c)
302                            return sibling;
303                    else
304                            return sibling.spellRight(c);
305            }
306    }